Analysis and Visualization of Netflix content

Topics

1.0 Importing packages and loading data

2.0 Cleaning the data

2.1 Checking Null Values

2.2 Visualizing Null Values

2.3 Treating Null Values

3.0 Composition and Comparison Visualisations

3.1 Visualizing Composition of Content Type (Movies & TV Shows)

3.2 Countries producing maximum content on Netflix

3.3 Country-wise Composition of Content Type

3.4 Visualizing Composition of Content Rating

3.5 Qualitative Distribution of Content Type Across Maturity Ratings

3.6 Quantitative Distribution of Content Type Across Maturity Ratings

3.7 Count of Maturity Ratings for each Content Type

3.8 Composition of Content Ratings

4.0 Evolution of Netflix Content & its Type over time

5.0 Study of Genres Correlations

6.0 Distribution of target audiences for each country

6.1 Studying the gap between release and upload of content in different countries

6.2 Comparing the netflix content of USA & India

7.0 Word Cloud

8.0 ML Classification Model

9.0 Interpreting the results

1.0 Importing packages and loading data

Import all the packages and load the required data downloaded using Kaggle.

In [1]:
In [2]:
Out[2]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
... ... ... ... ... ... ... ... ... ... ... ... ...
6229 80000063 TV Show Red vs. Blue NaN Burnie Burns, Jason Saldaña, Gustavo Sorola, G... United States NaN 2015 NR 13 Seasons TV Action & Adventure, TV Comedies, TV Sci-Fi ... This parody of first-person shooter games, mil...
6230 70286564 TV Show Maron NaN Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh... United States NaN 2016 TV-MA 4 Seasons TV Comedies Marc Maron stars as Marc Maron, who interviews...
6231 80116008 Movie Little Baby Bum: Nursery Rhyme Friends NaN NaN NaN NaN 2016 NaN 60 min Movies Nursery rhymes and original music for children...
6232 70281022 TV Show A Young Doctor's Notebook and Other Stories NaN Daniel Radcliffe, Jon Hamm, Adam Godley, Chris... United Kingdom NaN 2013 TV-MA 2 Seasons British TV Shows, TV Comedies, TV Dramas Set during the Russian Revolution, this comic ...
6233 70153404 TV Show Friends NaN Jennifer Aniston, Courteney Cox, Lisa Kudrow, ... United States NaN 2003 TV-14 10 Seasons Classic & Cult TV, TV Comedies This hit sitcom follows the merry misadventure...

6234 rows × 12 columns

2.0 Cleaning the data

All the found null values will be handled below.

2.1 Checking Null Values

In [3]:
Out[3]:
show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
dtype: int64

2.2 Visualizing Null Values

In [4]:
Out[4]:
<AxesSubplot:>

2.3 Treating Null Values

We have null values in director, cast,country,date_added and rating.So lets deal with it.

In [5]:
Out[5]:
array([2027, 1698,  701,  508,  286,  218,  184,  169,  149,  143,   95,
         37,    7,    2], dtype=int64)

We can remove the director and cast columns from the above data because they don't play a big role in how we visualise the data and don't add much value to our analysis. We are only interested in visualising this data, so removing two columns will not be a problem. However, this should not be done on a regular basis because if we are developing a recommender system, we cannot remove the director and cast of a film because these are important features used to recommend movies to users.

In [6]:
In [7]:
Out[7]:
show_id type title country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

We replaced all of the Nan values in the country column with United States because Netflix was founded in the United States and all shows are available on Netflix US. So, instead of deleting the entire column, we simply replaced the values in it to save our data.

In [8]:

We already know the release year for each film, so even if we don't know the release date, it won't have much of an impact on our analysis. As a result, we can remove the column containing the release date.

In [9]:
Out[9]:
show_id type title country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
In [10]:
Out[10]:
TV-MA       2027
TV-14       1698
TV-PG        701
R            508
PG-13        286
NR           218
PG           184
TV-Y7        169
TV-G         149
TV-Y         143
TV-Y7-FV      95
G             37
UR             7
NC-17          2
Name: rating, dtype: int64
In [11]:
Out[11]:
Documentaries                                              299
Stand-Up Comedy                                            273
Dramas, International Movies                               248
Dramas, Independent Movies, International Movies           186
Comedies, Dramas, International Movies                     174
                                                          ... 
International Movies, Romantic Movies, Sci-Fi & Fantasy      1
Docuseries, Spanish-Language TV Shows                        1
Anime Features, International Movies                         1
TV Sci-Fi & Fantasy, TV Thrillers                            1
Action & Adventure, Classic Movies, Sci-Fi & Fantasy         1
Name: listed_in, Length: 461, dtype: int64

As we can see, our rating column only has ten missing values, which we can either drop or replace. Because TV-MA is the most common raing, we can substitute it for all of these nan values.

In [12]:
In [13]:
Out[13]:
show_id          0
type             0
title            0
country          0
date_added      11
release_year     0
rating           0
duration         0
listed_in        0
description      0
dtype: int64

Now that we've dealt with all of our missing data, let's get started on our data visualisation.

In [14]:
Out[14]:
show_id type title country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

3.0 Composition and Comparison Visualisations

In [15]:
Out[15]:
<AxesSubplot:xlabel='type', ylabel='count'>

3.1 Visualizing Composition of Content Type (Movies & TV Shows)

In [16]:
In [17]:

3.2 Countries producing maximum content on Netflix

In [18]:
Out[18]:
country
United States 2508
India 777
United Kingdom 348
Japan 176
Canada 141
South Korea 136
Spain 117
France 90
Mexico 83
Turkey 79
Australia 71

By Country

So we now know there are much more movies than TV shows on Netflix (which surprises me!).

What about if we look at content by country?

I would imagine that the USA will have the most content. I wonder how my country, the UK, will compare?

In [19]:
In [20]:
<ipython-input-20-ea4d972bb035>:28: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(data.index, fontfamily='serif', rotation=0)

As predicted, the USA dominates.

The UK is a top contender too, but still some way behind India.

How does content by country vary?

3.3 Country-wise Composition of Content Type

In [21]:
<ipython-input-21-fa62701eaf63>:16: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(data_q2q3_ratio.index, fontfamily='serif', fontsize=11)

As I've noted in the insights on the plot, it is really interesting to see how the split of TV Shows and Movies varies by country.

South Korea is dominated by TV Shows - why is this? I am a huge fan of South Korean cinema so I know they have a great movie selection.

Equally, India is dominated by Movies. I think this might be due to Bollywood - comment below if you have any other ideas!

3.4 Visualizing Composition of Content Rating

In [22]:
Out[22]:
<AxesSubplot:xlabel='rating', ylabel='count'>

3.5 Qualitative Distribution of Content Type Across Maturity Ratings

In [23]:
Out[23]:
<AxesSubplot:xlabel='rating', ylabel='type'>

3.6 Quantitative Distribution of Content Type Across Maturity Ratings

Ratings

Let's briefly check out how ratings are distributed

In [24]:
In [25]:
<ipython-input-25-447ccea984d8>:27: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(mf.columns, fontfamily='serif')

3.7 Count of Maturity Ratings for each Content Type

In [26]:
Out[26]:
<AxesSubplot:xlabel='rating', ylabel='count'>

3.8 Composition of Content Ratings

In [27]:

4.0 Evolution of Netflix Content & its Type over time

How has content been added over the years? As we saw in the timeline at the start of this analysis, Netflix went global in 2016 - and it is extremely noticeable in this plot.

The increase is Movie content is remarkable.

In [28]:
Out[28]:
show_id type title country date_added release_year rating duration listed_in description count first_country target_ages genre month_added month_name_added year_added
0 81145628 Movie Norm of the North: King Sized Adventure United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra... 1 USA Older Kids [Children & Family Movies, Comedies] 9.0 September 2019.0
1 80117401 Movie Jandino: Whatever it Takes United Kingdom 2016-09-09 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra... 1 UK Adults [Stand-Up Comedy] 9.0 September 2016.0
2 70234439 TV Show Transformers Prime United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob... 1 USA Older Kids [Kids' TV] 9.0 September 2018.0
In [29]:
In [30]:
Out[30]:
<AxesSubplot:xlabel='release_year', ylabel='count'>

As we can see, the majority of the movies and television shows on Netflix were released in the last decade, with only a few exceptions released earlier.

In [31]:
Out[31]:
Text(0.5, 1.0, 'Frequency of Movies which were released in different years and are available on Netflix')
In [32]:
Out[32]:
Text(0.5, 1.0, 'Frequency of TV shows which were released in different years and are available on Netflix')

What about a more interesting way to view how content is added across the year?

Sometimes visualizations should be eye-catching & attention grabbing - I think this visual acheives that, even if it isn't the most precise.

By highlighting certain months, the reader's eye is drawn exactly where we want it.

In [33]:
In [34]:

Yes, December & January are definitely the best months for new content. Maybe Netflix knows that people have a lot of time off from work over this period and that it is a good time to reel people in?

February is the worst - why might this be? Ideas welcomed!

In [35]:
United StatesBulgaria, United States, Spain, CanadaChileSpainUnited KingdomUnited States, India, South Korea, ChinaUnited States, United Kingdom, Denmark, Sweden

5.0 Study of Genres Correlations

Movie Genres

Let's now explore movie genres a little...

In [36]:
In [37]:
<ipython-input-36-0c99ea335b94>:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-36-0c99ea335b94>:16: DeprecationWarning:

`np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

There are 20 types in the Netflix Movie Dataset
In [38]:
Out[38]:
Text(0.5, 1.0, 'Top 10 Genres of Movies')
In [39]:
Out[39]:
Text(0.5, 1.0, 'Top 10 Genres of TV Shows')

6.0 Distribution of target audiences for each country

In [40]:
Out[40]:
country count
0 United States 2508
1 India 777
2 United Kingdom 348
3 Japan 176
4 Canada 141
In [41]:

Target Ages

Does Netflix uniformly target certain demographics? Or does this vary by country?

In [42]:
In [43]:

Very interesting results.

It is also interesting to note similarities between culturally similar countries - the US & UK are closey aligned with their Netflix target ages, yet vastly different to say, India or Japan!

6.1 Studying the gap between release and upload of content in different countries

Let's have a quick look at the lag between when content is released and when it is added on Netflix

Spain looks to have a lot of new content. Great for them!

In [44]:
<ipython-input-44-a0f109f505c4>:7: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

What about TV shows...

In [45]:
<ipython-input-45-f90290b06fdf>:5: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

6.2 Comparing the netflix content of USA & India

USA & India

As the two largest content countries, it might be fun to compare the two

In [46]:
In [47]:
<ipython-input-47-bf67c766fbb8>:25: UserWarning:

FixedFormatter should only be used together with FixedLocator

So the USA dominates. Lemme do this other way around....

In [48]:
In [49]:

7.0 WordCloud

We have taken the title column to display the wordcloud. Instead of using the normal wordcloud, we have built the cloud in the logo of netflix.

In [50]:

8.0 ML Classification Model

We have labels of 10 ratings, and the number of rows of data we have is very less so making a classifier using all the 14 labels would be very difficult.

Hence, we need to reduce the number of labels to 3, we have taken the top two categories and remaining as other.

In [51]:
Out[51]:
{'TV-PG': 10,
 'TV-MA': 1,
 'TV-Y7-FV': 10,
 'TV-Y7': 10,
 'TV-14': 4,
 'R': 10,
 'TV-Y': 10,
 'NR': 10,
 'PG-13': 10,
 'TV-G': 10,
 'PG': 10,
 'G': 10,
 'UR': 10,
 'NC-17': 10}
In [52]:
Out[52]:
show_id type title country date_added release_year rating duration listed_in description count first_country target_ages genre month_added month_name_added year_added rating_labels
0 81145628 Movie Norm of the North: King Sized Adventure United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra... 1 USA Older Kids [Children & Family Movies, Comedies] 9.0 September 2019.0 10
1 80117401 Movie Jandino: Whatever it Takes United Kingdom 2016-09-09 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra... 1 UK Adults [Stand-Up Comedy] 9.0 September 2016.0 1
2 70234439 TV Show Transformers Prime United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob... 1 USA Older Kids [Kids' TV] 9.0 September 2018.0 10
3 80058654 TV Show Transformers: Robots in Disguise United States 2018-09-08 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of... 1 USA Older Kids [Kids' TV] 9.0 September 2018.0 10
4 80125979 Movie #realityhigh United States 2017-09-08 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts... 1 USA Teens [Comedies] 9.0 September 2017.0 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6229 80000063 TV Show Red vs. Blue United States NaT 2015 NR 13 Seasons TV Action & Adventure, TV Comedies, TV Sci-Fi ... This parody of first-person shooter games, mil... 1 USA Adults [TV Action & Adventure, TV Comedies, TV Sci-Fi... NaN NaN NaN 10
6230 70286564 TV Show Maron United States NaT 2016 TV-MA 4 Seasons TV Comedies Marc Maron stars as Marc Maron, who interviews... 1 USA Adults [TV Comedies] NaN NaN NaN 1
6231 80116008 Movie Little Baby Bum: Nursery Rhyme Friends United States NaT 2016 TV-MA 60 min Movies Nursery rhymes and original music for children... 1 USA Adults [Movies] NaN NaN NaN 1
6232 70281022 TV Show A Young Doctor's Notebook and Other Stories United Kingdom NaT 2013 TV-MA 2 Seasons British TV Shows, TV Comedies, TV Dramas Set during the Russian Revolution, this comic ... 1 UK Adults [British TV Shows, TV Comedies, TV Dramas] NaN NaN NaN 1
6233 70153404 TV Show Friends United States NaT 2003 TV-14 10 Seasons Classic & Cult TV, TV Comedies This hit sitcom follows the merry misadventure... 1 USA Teens [Classic & Cult TV, TV Comedies] NaN NaN NaN 4

6234 rows × 18 columns

We will be splitting the data manually because using train_test_split there is no guarantee that the data will be splitted qullay on ecah category.

In [53]:
In [54]:
(4987,)
(1247,)
(4987,)
(1247,)

TfiDfvectorizer has been used to convert the description labels to array of numbers whcih can be used in Naive Bayes model and Logistic Regression for training.

In [55]:
Wall time: 2.99 ms
Out[55]:
0.6850239077080807
In [56]:
              precision    recall  f1-score   support

           1       0.52      0.28      0.36       407
           4       0.32      0.03      0.05       340
          10       0.45      0.91      0.60       500

    accuracy                           0.46      1247
   macro avg       0.43      0.40      0.34      1247
weighted avg       0.44      0.46      0.37      1247

Logistic Regression is a lot slower when compared to Naive Baye's.

In [57]:
Wall time: 899 ms
Out[57]:
0.675359292316342
In [58]:
              precision    recall  f1-score   support

           1       0.49      0.41      0.45       407
           4       0.40      0.21      0.27       340
          10       0.52      0.77      0.62       500

    accuracy                           0.50      1247
   macro avg       0.47      0.46      0.45      1247
weighted avg       0.48      0.50      0.47      1247

9.0 Interpreting the result

We have tried to show different techniques of visualization which could help in keeping the audience engaged throughout the presentation. Plots helps in expressing our views better and making the people understand things nicely and very easily.